This tutorial runs through predicting the chromatin interactions of an example region, plotting the predictions as well as the skeleton and the “raw” HiC interactions. It will also demonstrate how to plot 1D tracks such as ChIp-seq signals attached to the deepC plots. Further, the tutorial demonstrates how to make a prediction of the impact of sequence variant, plotting the reference, variant and interactions and a differential plot and also calculate an associated damage score.
You can find the smaller example files in the gitHub repository in ./tutorials. Larger files are linked under formatted_data_links.
Load packages and source custom functions for deepC and general HiC processing and visualization.
library(tidyverse)
library(cowplot)
library(RColorBrewer)
source("../deepC/helper_for_preprocessing_and_analysis/functions_for_HiC.R")
source("../deepC/helper_for_preprocessing_and_analysis/functions_for_deepC.R")
Set global options, specifiying bin size and bp context of the model.
sample <- "GM12878"
bin.size <- 5000
window.size <- 1000000 + bin.size # bp_context
interaction.value.bins <- 10 # how many HiC skeleton bins have been used (10 for all published models)
prediction.bins <- window.size/bin.size # number of predictions bins (also called number of classes later). Refers to number of bins in the vertial interaction pole
# depends on the bin.size and bp_context selected: 201 for 5kb models | 101 for 10k b models
slope <- 1/(bin.size/2) # slope for visualizing interactions with the central bin in later visualizations
We start by predicting the chromatin interactions of a region of chr17 (test chromosome in manuscript). No variants yet just reference sequence.
Have a look at the example regions file we provide the deepC script with. Format: chromosome start end variant in tab separated columns. The format is bed-like so 0-based, half open coordinates! To only predict on the reference sequence use the variant_tag ‘reference’ without quotation. We can run longer patches of sequence like in the example below (recommended up to ~5 Mb). For longer patches it is recommended to split the region of intrest into chunks of 3 - 5 Mb and concatitnate the predictions afterwards.
head example_region_short.bed
## chr17 71000000 72000000 reference
Now we run the predictions over the example regions. It is worth running this command in a shell script for nohup or queue on a cluster as this is the labour intensive step. I usually run the python script first and then move to analysis and visualization within R.
# ! in bash
python ../deepC/current_version/run_deploy_shape_deepCregr.py --input example_region_short.bed \ # input deepC variant bed-like file
--out_dir ./test_predict_out \ #output directory
--name_tag predict \ # name tag to add to ouput files
--model ../example_data/models/GM12878_primary_5kb/model \ # trained deepC model
--genome ~/Data/reference_sequence/hg19.fa \ # link to whole genome or chromosome wise fasta file (needs a fasta index)
--bp_context 1005000 \ # bp context (1 Mb + bin.size)
--add_window 500000 \ # how much bp to add to either side of the specified window
--num_classes 201 \ # The number of classes corresponds to the number of outputs (output bins of the vertical pole) (201 for 5kb models; 101 for 10kb models)
--bin_size 5000 \ # bin size matching to the model and input data selected
--run_on gpu # specify to run on gpu or cpu (cpu takes significantly longer)
Having a look at the output format. The variant file (also called so for reference predictions) is a tab separated file: chr start end pred_bin_1 pred_bin_2 ….
The coordinates specify the sequence window (1Mb + bin size) centered on this are the vertical pole interaction predictions moving from close to diagonal to distal interactions. The coordinates refer to the center of a bin.sized bin. Get the center bin the predictions map to by calculating the center of this window. Subtract bin.size/2 to get left most base pair bin specification.
The predictions are regression values corresponding to the HiC skeleton. The header files list the reference or variant region and where they map to in the reference genome. They also specify if any bp shifts occured due to differences between reference and variant length. This is obviously not relevant for reference predictions. We can concatinate multiple reference predictions files to combine bigger regions or whole chromosome predictions in a single file. To do this just concatinate, remove header lines (grep -v “#”) and sort (if not concatinated in order).
head test_predict_out/class_predictions_predict_short_1_chr17_71000000_71999999.txt | cut -f 1,2,3,4,5,6
## # Variant Queried: chr17 71000000 72000000 reference
## # Mapping to relative reference coordinates: chr17 70500000 72500000
## # Bp to adjust after Variant: 0
## chr17 69997500 71002500 3.9926 4.0854 4.0988
## chr17 70002500 71007500 3.9043 4.0909 4.1212
## chr17 70007500 71012500 3.7195 4.0692 4.0738
## chr17 70012500 71017500 3.7844 4.208 4.0911
## chr17 70017500 71022500 3.5739 4.0021 3.8372
## chr17 70022500 71027500 3.3619 3.9075 3.5272
## chr17 70027500 71032500 3.1335 3.7965 3.2948
Read the reference prediction file. Use the readDeepcVariantFile function. If reading in a concatinated variant file without header use the readDeepcVariantFileNoHeader
ref.pred <- readDeepcVariantFile("./test_predict_out/class_predictions_predict_short_1_chr17_71000000_71999999.txt",
prediction.bins = prediction.bins,
bin.size = bin.size,
gather = T) # specify gather TRUE or FALSE if to gather the prediction data in a long format for ggplot2 etc. FALSE: broad format
# print structure
str(ref.pred)
## List of 8
## $ df :Classes 'tbl_df', 'tbl' and 'data.frame': 80601 obs. of 4 variables:
## ..$ chr : Factor w/ 1 level "chr17": 1 1 1 1 1 1 1 1 1 1 ...
## ..$ pos : num [1:80601] 70502500 70507500 70512500 70517500 70522500 ...
## ..$ bin : num [1:80601] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ value: num [1:80601] 3.99 3.9 3.72 3.78 3.57 ...
## $ variant.chrom : chr "chr17"
## $ variant.start : num 7.1e+07
## $ variant.end : num 7.2e+07
## $ relative.chrom: chr "chr17"
## $ relative.start: num 70500000
## $ relative.end : num 72500000
## $ bp.adjust : num 0
The deepC variant file is stored in a list which captures the actual predictions in a data frame and the genomic coordinates from variant and relative mapping (same in reference case) and the bp shift. To view the predictions access:
ref.pred$df
## # A tibble: 80,601 x 4
## chr pos bin value
## <fct> <dbl> <dbl> <dbl>
## 1 chr17 70502500. 1. 3.99
## 2 chr17 70507500. 1. 3.90
## 3 chr17 70512500. 1. 3.72
## 4 chr17 70517500. 1. 3.78
## 5 chr17 70522500. 1. 3.57
## 6 chr17 70527500. 1. 3.36
## 7 chr17 70532500. 1. 3.13
## 8 chr17 70537500. 1. 2.86
## 9 chr17 70542500. 1. 2.55
## 10 chr17 70547500. 1. 2.53
## # ... with 80,591 more rows
For plotting HiC data in triangular format (half of the matrix spanning the horizontally layed out genome) in ggplot2 we implemented a function that transforms the interaction values into a diamond shape (4 polygon points). We then plot them as the area between the points. Note this function yields really large vector graphics so only save as raster images (png/jpeg).
# provide the interaction data frame and the bin.size
t.rdf <- triangularize(ref.pred$df, bin = bin.size)
We set the limits for plotting and plot the predicted interactions
# set limits for plotting
limit1 <- ref.pred$variant.start - 500000
limit2 <- ref.pred$variant.end + 500000
# create plot
plot.rdf <- t.rdf %>%
ggplot(aes(x = pos, y = (bin*bin.size)/1000, fill = value, group = polyid)) +
geom_polygon() +
labs(y = "Genomic Distance [kb]") +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
coord_cartesian(xlim=c(limit1, limit2))
plot(plot.rdf)
The plot shows the predicted (skeleton) interaction value with genomic position on the x-axis and the genomic distance between the elements interacting on the y-axis.
Now lets overlap this with the actual Hi-C skeleton. Here we read in the already formated HiC skeleton data (example download links under formated_data_links). Check out the tutorial of how to generate your skeleton data from HiC data to reproduce or create your own. The format is tab separated: chr start end skeleton_interaction_values. The coordinates refer to the bp_context sized window and the interaction values are centered on this window. The interactions are comma separated corrseponding to the prediction bins from close to diagonal to distal interactions.
head ./example_skeleton_gm12878_5kb_chr17.bed
## chr17 0 1005000 10,10,9,6,1,9,6,9,7,6,9,5,6,6,4,10,9,4,4,6,7,7,8,10,6,9,9,10,7,7,10,5,9,10,10,5,10,10,9,10,10,10,10,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,7,2,9,5,8,9,6,10,9,10,10,6,9,10,8,3,10,10,10,6,5,10,8,9,10,10,9,6,8,8,10,10,10,8,9,8,8,10,9,9,8,10,6,10,10,8,10,7,10,10,8,6,10,10,10,10,10,8,3,10,10,5,9,10,10,10,8,9,10,10,10,10,8,10,10,9,10,6,6,10,9,9,1,9,8,8,9,6,10,9,5,9,10,1,10,7,7,10,7,10,10,10,10,8,5,10,1,8,10,10,6,10,5,8,5,10,10,1,5,5
## chr17 5000 1010000 8,10,6,2,1,6,9,3,3,6,9,4,9,6,2,10,7,7,6,8,9,7,5,4,10,10,9,9,10,6,6,10,8,10,10,10,9,10,10,4,9,10,10,10,10,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,8,6,5,9,10,10,10,8,10,10,10,6,7,10,9,5,7,10,10,6,4,6,10,9,10,9,10,10,10,10,8,10,10,1,6,6,5,10,10,9,10,10,10,7,4,10,10,10,9,9,10,9,10,7,8,10,10,6,10,10,10,10,3,10,10,10,10,10,9,6,9,10,10,8,8,9,9,10,6,1,5,10,10,10,10,10,9,5,7,1,1,1,7,1,10,10,6,10,9,10,10,10,10,10,8,3,10,6,10,9,8,9,7,6,8,3,8,9
## chr17 10000 1015000 9,10,1,3,2,10,9,10,8,10,4,7,4,7,4,5,4,8,10,5,9,7,10,8,5,8,6,6,10,9,10,9,9,10,8,8,9,10,9,9,9,10,10,10,10,4,9,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,6,5,9,4,9,9,10,8,10,9,6,9,9,2,6,2,8,8,6,8,10,9,8,10,10,4,4,8,10,1,10,10,9,4,10,9,10,10,10,10,10,8,7,10,10,10,10,10,10,10,8,10,10,10,8,7,9,10,10,10,9,10,10,9,10,9,10,9,10,9,10,5,9,4,10,1,1,1,8,10,6,6,4,10,10,10,5,10,1,10,7,1,10,9,7,3,10,9,6,9,7,8,1,9,6,6,5,5,10,6,7,6,8,5,8,8
## chr17 15000 1020000 10,1,2,6,8,8,2,5,1,7,3,10,6,4,6,8,7,7,8,8,9,9,10,9,7,10,7,6,8,7,8,10,7,9,9,10,9,10,10,9,7,9,5,9,9,6,9,10,10,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,9,3,6,9,6,8,9,8,6,2,6,6,6,8,6,5,10,9,3,6,10,10,7,8,10,9,1,10,4,10,8,10,10,10,10,6,7,10,10,9,7,3,10,4,10,9,10,10,10,10,10,9,10,1,9,10,10,10,10,9,6,10,9,7,10,10,1,8,7,8,1,10,4,10,10,9,7,10,10,9,10,10,10,10,3,6,5,7,8,1,1,10,10,1,9,10,10,10,9,7,10,3,9,10,3,4,10,1,7,6,6,8,5,5
## chr17 20000 1025000 10,7,7,8,6,7,8,2,1,10,10,10,9,7,9,8,7,9,9,6,9,8,8,2,5,9,9,9,6,7,9,10,9,9,6,9,7,7,6,9,2,9,7,9,4,5,5,9,9,10,9,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,5,8,5,6,8,6,8,8,8,10,9,10,7,3,9,10,10,5,10,9,8,6,10,8,8,10,9,9,9,10,9,10,8,10,10,6,8,10,10,5,10,9,10,10,10,9,10,9,10,8,10,9,8,9,10,10,10,10,6,10,3,10,8,5,6,4,9,10,9,9,10,10,9,7,9,10,10,1,6,10,10,5,9,8,6,8,10,9,10,1,10,5,10,9,7,3,1,7,5,5,6,5,10,1,7,8,1,5,9,6,6,6
## chr17 25000 1030000 8,2,6,9,2,7,2,8,5,6,6,10,8,6,6,5,7,9,7,6,9,9,5,10,6,10,7,8,9,10,9,7,10,10,7,8,9,7,6,4,6,8,6,9,6,8,7,6,10,8,8,6,6,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,9,10,7,8,9,6,4,6,10,6,10,10,3,10,6,3,5,10,9,5,8,5,10,6,6,9,10,5,10,10,10,5,10,6,9,10,10,10,10,10,10,6,10,9,7,10,10,10,10,9,9,10,9,10,10,10,10,9,9,6,1,9,10,10,7,9,8,8,7,9,10,9,7,10,9,8,10,7,10,6,1,10,10,9,1,6,10,10,10,10,9,7,9,7,6,9,9,8,7,8,8,10,4,5,5,6,7,4
## chr17 30000 1035000 7,5,8,1,2,3,3,6,4,9,6,5,6,8,7,8,8,2,8,7,5,6,5,9,6,10,7,9,9,10,5,6,9,3,8,5,5,7,6,7,7,4,6,8,6,5,7,2,8,2,6,10,10,10,7,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,7,9,6,8,7,9,10,1,7,7,8,9,3,4,7,10,9,9,8,10,10,10,10,8,9,3,9,5,9,10,9,10,6,7,8,10,10,10,10,10,8,7,10,9,9,9,10,7,9,10,10,10,1,6,10,10,4,1,10,1,10,1,9,6,10,9,8,10,6,10,10,7,10,5,10,10,6,5,5,8,10,1,4,10,9,10,8,10,9,9,1,8,1,3,8,4,8,10,3,8,5,1,1,1,9,1,5
## chr17 35000 1040000 10,10,6,9,6,1,4,2,2,2,5,9,8,4,8,9,9,2,7,6,8,8,7,4,9,9,6,8,6,6,9,7,9,6,4,5,4,2,7,4,8,3,9,6,8,8,4,3,7,8,8,7,6,5,5,8,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,6,8,8,9,8,9,6,10,5,5,9,9,6,2,5,5,8,10,9,9,9,10,10,8,4,6,7,9,10,10,10,8,8,6,10,10,10,10,6,9,10,9,10,10,10,9,9,9,10,3,10,9,1,10,8,9,8,5,9,9,8,10,5,10,10,10,9,9,10,10,1,6,5,10,6,10,10,1,1,8,3,3,9,9,8,8,4,3,10,1,1,5,10,6,5,5,9,1,6,3,5,3,10,1,6,7
## chr17 40000 1045000 6,1,4,3,8,6,6,2,6,4,7,7,6,10,4,6,7,3,1,6,7,6,5,9,9,9,4,5,4,6,4,8,7,2,3,1,6,7,5,4,3,4,5,3,7,4,6,4,3,8,8,9,2,7,3,5,8,8,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,6,5,7,5,10,10,10,2,10,7,6,3,8,4,4,8,6,2,10,9,10,10,10,9,1,6,8,10,10,10,10,4,1,9,10,10,9,10,9,9,7,10,9,7,10,9,9,10,10,10,10,1,9,9,10,9,7,10,8,3,1,1,6,9,7,8,10,7,7,7,9,8,10,10,7,5,9,5,10,8,8,7,4,1,9,7,4,10,1,5,9,7,8,6,3,5,1,6,7,6,8,1,1
## chr17 45000 1050000 10,10,7,5,4,4,8,6,5,2,8,8,10,5,3,2,10,7,3,4,6,2,3,9,7,8,5,8,3,3,3,8,5,3,7,3,3,5,6,6,2,4,6,6,9,7,4,9,2,4,9,7,5,6,6,4,4,7,10,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,10,1,6,8,9,10,10,10,10,3,8,6,9,7,7,6,8,9,9,7,8,10,9,10,9,3,10,10,10,10,10,10,10,5,6,8,10,10,9,10,8,4,10,7,7,6,9,1,10,10,10,9,5,6,7,9,10,9,9,8,8,9,7,6,8,7,8,4,10,9,1,6,6,5,5,8,10,9,8,4,6,4,5,7,6,4,7,4,1,1,1,1,8,8,6,5,4,4,5,10,7,9
skel.file <- "./example_skeleton_gm12878_5kb_chr17.bed"
# function can be used for HiC or Skeleton values as long as they are collated in a comma separated single column
skel.df <- readDeepcInputHicBed(skel.file, prediction.bins = prediction.bins, bin.size = bin.size, gather=TRUE)
head(skel.df)
## # A tibble: 6 x 4
## chr pos bin value
## <fct> <dbl> <dbl> <dbl>
## 1 chr17 502500. 1. 10.
## 2 chr17 507500. 1. 8.
## 3 chr17 512500. 1. 9.
## 4 chr17 517500. 1. 10.
## 5 chr17 522500. 1. 10.
## 6 chr17 527500. 1. 8.
Triangularize the skeleton and plot the overlay.
# set limits to what is covered in the reference prediction file
limit1 <- min(ref.pred$df$pos)
limit2 <- max(ref.pred$df$pos)
# triangularize --> format to diamond shape polygons
t.skel.df <- triangularize(skel.df, bin.size)
# create the plot
plot.skel.df <- t.skel.df %>%
ggplot(aes(x = pos, y = (bin*bin.size)/1000, fill = value, group = polyid)) +
geom_polygon() +
labs(y = "Genomic Distance [kb]") +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
xlim(limit1, limit2)
# plot using cowplot and helper themes for merging.
plot_grid(plot.skel.df + upper_overlay_theme,
plot.rdf + lower_overlay_theme,
nrow = 2, align = "v")
To add the source HiC data we read in sparse matrix files, either in HiC-Pro format (bins with bin ids and interaction values) and a bed file specifying the genomic coordinates of the bin ids or a matrix with genomic coordinates instead of bin ids e.g. as downloaded from Rao et al. (GSE63525) use the intrachromsosomal interaction matrices and apply norm factors as desired (here KRnorm). Examples linked in formatted_data_links. Note this format lists the left most base pair as identifier for the respective bin.
head gm12878_primary_chr17_5kb.contacts.KRnorm.matrix
## 0 0 1092.39634212411
## 0 5000 272.056324049362
## 5000 5000 707.645865108206
## 0 10000 113.901873171691
## 5000 10000 202.53587051085
## 10000 10000 817.905220550323
## 0 15000 104.228608910218
## 5000 15000 120.956194934016
## 10000 15000 261.414278868473
## 15000 15000 1263.85448972055
Here is how we read in HiC data, process and plot them.
# single chromosome intra chromosomal interaction matrix
matrix.norm.loc <- "./gm12878_primary_chr17_5kb.contacts.KRnorm.matrix"
# to read in a HiC matrix with the format coord_bin1 coord_bin2 interaction_value (e.g. as Rao et al. downloads)
# provide the matrix file, the chromosome and bin.size arguments
# if the format is like HiC-Pro (bin1_id bin2_id interaction_value) with a separate bed-like coordinate file
# provide the coordinate file in the "coords=" argument of the Import function
hic <- ImportHicproMatrix(matrix.norm.loc, chr="chr17", bin.size=bin.size) # import
hic <- trimHicCoords(hic, start = limit1-window.size, end = limit2+window.size) # remove HiC contacts that are not withon our region of interest
hic <- trimHicRange(hic, range = window.size + bin.size) # remove all interactions that are more then bp_context apart
# check the first covered position in the HiC data
start.pos <- hic$start.pos - bin.size/2
# to map the hic data into consistent bins for deepC training, prediction and plotting we create get a binned genome template
# this requires bedtools to be installed as the R function calls bedtools in the background in bash
# add at least the bp_context window.size to the region of interest coordinates
binned.genome <- getBinnedChrom(chr="chr17", start=limit1-window.size, end=limit2+window.size, window=window.size, step=bin.size)
## [1] "bedtools makewindows -b tempfile_for_bedtools_call_2019-11-20_15:12:16_0.900164404418319.bed -w 1005000 -s 5000 >tempfile_from_bedtools_call_2019-11-20_15:12:16_0.900164404418319.bed"
binned.genome <- as.tibble(binned.genome)
names(binned.genome) <- c("chr", "start", "end")
# inspect
binned.genome
## # A tibble: 804 x 3
## chr start end
## <chr> <dbl> <dbl>
## 1 chr17 69495000. 70500000.
## 2 chr17 69500000. 70505000.
## 3 chr17 69505000. 70510000.
## 4 chr17 69510000. 70515000.
## 5 chr17 69515000. 70520000.
## 6 chr17 69520000. 70525000.
## 7 chr17 69525000. 70530000.
## 8 chr17 69530000. 70535000.
## 9 chr17 69535000. 70540000.
## 10 chr17 69540000. 70545000.
## # ... with 794 more rows
# filter out incomplete megabase bins
binned.genome <- binned.genome %>%
mutate(width = end - start) %>%
filter(width == window.size) %>%
select(-width)
# # # if necessary add sub telomeric regions to binned genome
# to.add <- tibble(chr = chrom, start = seq(start.pos - window.size/2, start.pos - bin.size, bin.size))
# to.add <- to.add %>% mutate(end = start + window.size)
# binned.genome <- rbind(to.add, binned.genome)
# format HiC data into deepC vertical pole distance bin data frame format
# link to the helper perl script in your deepC repository copy
hic.df <- getZigZagWindowInteractionsPerl(hic, binned.genome, window.size, bin.size, query.pl = "../deepC/helper_for_preprocessing_and_analysis/match_query_table.pl")
## [1] "saved temp_hic_file"
## [1] "saved temp_query_file"
## [1] "matching ..."
hic.df <- as.tibble(hic.df)
# to visualize hic data we add a position column and gather
hic.df <- hic.df %>%
mutate(pos = start + (end - start)/2) %>%
select(chr, pos, c(4:(prediction.bins+3))) %>%
gather(bin, value, -chr, -pos) %>%
mutate(bin = as.numeric(bin)) %>%
mutate(pos = if_else(bin %% 2 == 0, pos - bin.size/2, pos)) %>% # this indents the even distance bin positions to get the zig-zag layout
filter(value > 0)
# To visualize HiC data we susually quantile squeeze the very low and very high interaction values to increase contrast in the middle value range
hic.df$value <- SetValueRange(hic.df$value, min=as.numeric(quantile(hic.df$value, .05)), max=as.numeric(quantile(hic.df$value, .95)))
Now plot the HiC data as overlap.
# triangularize
t.hic.df <- triangularize(hic.df, bin.size)
plot.hic.df <- ggplot(t.hic.df, aes(x = pos, y = (bin*bin.size)/1000, fill = value, group = polyid)) +
geom_polygon() +
labs(y = "Genomic Distance [kb]") +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
xlim(limit1, limit2) +
theme(strip.background = element_blank())
# plot using cowplot and helper themes for merging.
plot_grid(plot.hic.df + upper_overlay_theme,
plot.skel.df + upper_overlay_theme,
plot.rdf + lower_overlay_theme,
nrow = 3, align = "v")
We can also overlay 1D signals from bigwig files such as CTCF ChIP-seq or DNase-seq tracks. Example tracks for GM12878 in 50 bp windows are downloadable from formated_data_links or just checkout ENCODE or your ressources of choice.
# we require some more libraries for this
library(rtracklayer)
# library(GenomicRanges)
bw.ctcf <- "./ctcf_gm12878_encode_broad_merged_w50.bw"
bw.dnase <- "./dnase_gm12878_encode_merged_w50.bw"
# initialize GRanges object
groi <- makeGRangesFromDataFrame(tibble(chrom = "chr17", start = limit1, end = limit2))
# import bgigwig signal over region of interest
gr.ctcf <- import(bw.ctcf, which = groi) # Import BigWig Signal
df.ctcf <- as.tibble(gr.ctcf)
## Warning in as.data.frame(mcols(x), ...): Arguments in '...' ignored
df.ctcf <- df.ctcf %>%
mutate(pos = start + (end - start)/2) %>%
select(c(pos, width, score))
gr.dnase <- import(bw.dnase, which = groi) # Import BigWig Signal
df.dnase <- as.tibble(gr.dnase)
## Warning in as.data.frame(mcols(x), ...): Arguments in '...' ignored
df.dnase <- df.dnase %>%
mutate(pos = start + (end - start)/2) %>%
select(c(pos, width, score))
# combine in data frame
df.tracks <- rbind(df.ctcf %>% mutate(source = "CTCF"),
df.dnase %>% mutate(source = "DNase"))
# plot 1D signal
p.tracks <- df.tracks %>%
ggplot(aes(x = pos, y = score, col = source)) +
geom_area() +
scale_color_brewer(palette = "Set1") +
xlim(limit1, limit2) +
facet_wrap(~source, nrow = 2, strip.position = "right") +
theme(strip.background = element_blank(), legend.position = "None")
plot.skel.pred.tracks <- plot_grid(plot.hic.df + upper_overlay_theme,
plot.skel.df + upper_overlay_theme,
plot.rdf + lower_overlay_theme,
p.tracks + theme(
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank()
),
nrow = 4, align = "v", axis = "tlbr", rel_heights = c(1,1,1,.7))
plot(plot.skel.pred.tracks)
Now lets delete a CTCF site. To delete a sequence supply ‘.’ in the variant tag (without quoatation) e.g.
head example_variant.bed
## chr17 71706322 71706672 .
To replace a sequence chunk (or single base pair) with a sequence variant just supply the variant sequence in the variant_tag column. The length of the sequences don’t have to match. Note that when the reference and variant sequence have unequal length, the sequence window in focus is shifted according to the difference which can lead to predicted changes shadow (see manuscript supplementary figures). You might want to consider supplementing N’s instead of deleting bases.
We go ahead and delete the example variant sequence as provided. We run the deepC predictor again supplying the variant deepC bed file. We specify to predict over an area of half the sequence window around the variant.
# ! in bash
python ../deepC/current_version/run_deploy_shape_deepCregr.py --input example_variant.bed \ # evariant deepC bed-like file
--out_dir ./test_variant_out \ # output directory
--name_tag predict_variant \
--model ../example_data/models/GM12878_primary_5kb/model \
--genome ~/Data/reference_sequence/hg19.fa \
--bp_context 1005000 \
--add_window 500000 \ # window around variant to predict
--num_classes 201 \
--bin_size 5000 \
--run_on gpu
First lets have a quick look at the variant prediction output. Format as with the reference file before but now the bp to adjust are non-zero because we deleted 350 bps.
head test_variant_out/class_predictions_predict_variant_1_chr17_71706322_71706671.txt | cut -f 1,2,3,4,5
## # Variant Queried: chr17 71706322 71706672 .
## # Mapping to relative reference coordinates: chr17 71205000 72210000
## # Bp to adjust after Variant: 350
## chr17 70702500 71707500 3.3787 4.5646
## chr17 70707500 71712500 3.5671 4.4369
## chr17 70712500 71717500 3.5982 4.6421
## chr17 70717500 71722500 3.5043 4.4437
## chr17 70722500 71727500 3.4526 4.4135
## chr17 70727500 71732500 3.3735 4.0503
## chr17 70732500 71737500 3.3697 4.0881
We read in the variant prediction data.
# load variant predictions
var.loc <- "test_variant_out/class_predictions_predict_variant_1_chr17_71706322_71706671.txt"
var <- readDeepcVariantFile(var.loc, prediction.bins = prediction.bins, bin.size = bin.size)
And plot the overlap between reference and variant predictions as well as a substraction plot
# filter the reference prediction for only those positions that are also covered (affected by) the variant prediction
ref.df2 <- ref.pred$df %>%
filter(pos > 0) %>%
filter(pos >= min(var$df$pos)) %>%
filter(pos <= max(var$df$pos))
# Make a new reference plot
p.ref <- triangularize(ref.df2, bin.size) %>%
ggplot(aes(x = pos, y = (bin*bin.size)/1000, fill = value, group=polyid)) +
geom_polygon() +
geom_vline(xintercept = var$variant.start, linetype = "dotted") +
labs(y = "Genomic Distance [kb]", x = "chr17") +
ggtitle("Reference") +
# scale_x_continuous(breaks=c(71400000, 71700000, 72000000)) +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
theme(strip.background = element_blank())
# make a variant
p.var <- triangularize(var$df, bin.size) %>%
ggplot(aes(x = pos, y = (bin*bin.size)/1000, fill = value, group=polyid)) +
geom_polygon() +
labs(y = "Genomic Distance [kb]", x = "chr17") +
geom_vline(xintercept = var$variant.start, linetype = "dotted") +
# scale_x_continuous(breaks=c(71400000, 71700000, 72000000)) +
ggtitle("Variant") +
scale_fill_gradientn(colours = brewer.pal(9, 'YlOrRd')) +
theme(strip.background = element_blank())
# Make a differential plot
p.diff <- plotDiffDeepC(ref.df2, var$df, bin.size = 5000) +
geom_vline(xintercept = var$variant.start, linetype = "dotted") +
geom_abline(slope = 1/(bin.size/(2*bin.size/1000)), intercept = -var$variant.start/(bin.size/(2*bin.size/1000)), linetype = "dashed") +
geom_abline(slope = -1/(bin.size/(2*bin.size/1000)), intercept = var$variant.start/(bin.size/(2*bin.size/1000)), linetype = "dashed") +
ggtitle(paste0("Var - Ref"))
# Make the combined plot
# adding the CTCF and DNase track from earlier as well
plot.variant.combined <- plot_grid(p.ref + upper_overlay_theme,
p.var + upper_overlay_theme,
p.diff + upper_overlay_theme,
p.tracks + xlim(c(min(var$df$pos), max(var$df$pos))) + theme(
axis.line.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_blank(),
axis.title.y = element_blank()
),
nrow = 4, align = "v", axis = "tlbr", rel_heights = c(1,1,1,.8))
## Warning: Removed 21122 rows containing missing values (position_stack).
plot(plot.variant.combined)
Note that the CTCF and DNase tracks are still mapped to the reference. For larger deletions we would want to shift all data past the deletion by the bp_difference. For smaller differences the effect is however neglectable.
To get a single difference score we calculate the absolute difference (var - ref) normalized for the total number of bin interactions affected.
# calculate the total number of HiC interaction tiles (bin positions) affected by the variant
# used for normalizing the damage sscore
covered.tiles <- dim(var$df)[1]/prediction.bins * (window.size/bin.size)
# calculate the differntial data frame
diff.df <- getDiffDeepC(ref.df2, var$df)
# calculate the sum of the absolut difference and normalize for the number of covered tiles
avg.abs.diff <- sum(abs(diff.df$diff)) / covered.tiles
print(avg.abs.diff)
## [1] 0.6106955